class: center, middle, inverse, title-slide # Analytical Paleobiology ## with the Tidyverse ### Gregor H. Mathes ### University of Bayreuth/ Paleoecology Lectures ### 2021/06/18 (updated: 2021-06-16) --- class: center # The Tidyverse .center[] --- class: inverse, center # The Tidyverse ### - collection of <span style = 'color:#E69F00'>easy-to-use</span> tools for data analysis and visualization ### - <span style = 'color:#E69F00'>consistent</span> in both syntax and output ### - <span style = 'color:#E69F00'>widely used</span> in the industry and in science <br> .center[<img src="data:image/png;base64,#https://peadarcoyle.com/wp-content/uploads/2019/01/hadley-wickham.jpg" alt="Picture of Hadley Wickham" width="500"/> ] --- background-image: url(data:image/png;base64,#https://raw.githubusercontent.com/tidyverse/tidyverse/master/man/figures/logo.png) background-position: 90% 10% ## `library(tidyverse)` will load ## the core tidyverse packages: #### [ggplot2](http://ggplot2.tidyverse.org), for data visualisation. #### [dplyr](http://dplyr.tidyverse.org), for data manipulation. #### [tidyr](http://tidyr.tidyverse.org), for data tidying. #### [readr](http://readr.tidyverse.org), for data import. #### [purrr](http://purrr.tidyverse.org), for functional programming. #### [tibble](http://tibble.tidyverse.org), for tibbles, a modern re-imagining of data frames. #### [stringr](https://github.com/tidyverse/stringr), for strings. #### [forcats](https://github.com/hadley/forcats), for factors. --- # Agenda ## - read in data with the `readr` package <br> ## - wrangle data with the `dplyr` package <br> ## - visualise data with the `ggplot2` package --- class:inverse, mline, center, middle # The *readr* package --- # readr Function | Reads -------------- | -------------------------- `read_csv()` | Comma separated values `read_csv2()` | Semi-colon separate values `read_delim()` | General delimited files `read_fwf()` | Fixed width files `read_log()` | Apache log files `read_table()` | Space separated files `read_tsv()` | Tab delimited values <br> .center[and many more ...] --- # readr ```r dfr <- read_csv("file_name.csv") ``` <br> -- <html> <div style='float:left'></div> <hr color='#EB811B' size=1px width=720px> </html> <br> ```r dfr <- read_csv(here("figures/file_name.csv")) ``` <br> -- <html> <div style='float:left'></div> <hr color='#EB811B' size=1px width=720px> </html> <br> ```r url <- 'https://paleobiodb.org/data1.2/occs/list.txt?base_name=Carnivora&show=full' dfr <- read_csv(file = url) ``` --- # readr ```r carnivores <- read_csv(file = here("2021_06_18/carnivores.csv")) carnivores ``` ``` ## # A tibble: 12,168 x 118 ## occurrence_no record_type reid_no flags collection_no ## <dbl> <chr> <dbl> <lgl> <dbl> ## 1 117266 occ NA NA 9070 ## 2 137493 occ NA NA 11601 ## 3 137495 occ NA NA 11601 ## 4 138737 occ NA NA 11798 ## 5 138738 occ NA NA 11798 ## 6 138739 occ NA NA 11798 ## 7 138740 occ NA NA 11798 ## 8 138741 occ NA NA 11798 ## 9 138742 occ NA NA 11798 ## 10 138743 occ NA NA 11798 ## # … with 12,158 more rows, and 113 more variables: ## # identified_name <chr>, identified_rank <chr>, ## # identified_no <dbl>, difference <chr>, ## # accepted_name <chr>, accepted_attr <lgl>, … ``` --- # Tibbles ## <span style = 'color:#E69F00'>data.frames</span> are the basic form of rectangular data in R (columns of variables, rows of observations) --- # Tibbles ## <span style = 'color:#E5E5E5'>data.frames are the basic form of rectangular data in R (columns of variables, rows of observations</span> ## `read_csv()` reads the data into a <span style = 'color:#E69F00'>tibble</span>, a modern version of the data frame --- # Tibbles ## <span style = 'color:#E5E5E5'>data.frames are the basic form of rectangular data in R (columns of variables, rows of observations</span> ## <span style = 'color:#E5E5E5'>read_csv() reads the data into a tibble, a modern version of the data frame.</span> ## a tibble <span style = 'color:#E69F00'>is</span> a data frame --- # Saving data Function | Writes ------------------- | ---------------------------------------- `write_csv()` | Comma separated values `write_excel_csv()` | CSV that you plan to open in Excel `write_delim()` | General delimited files `write_file()` | A single string, written as is `write_lines()` | A vector of strings, one string per line `write_tsv()` | Tab delimited values `write_rds()` | A data type used by R to save objects `write_sas()` | SAS .sas7bdat files `write_xpt()` | SAS transport format, .xpt `write_sav()` | SPSS .sav files `write_stata()` | Stata .dta files .center[ and many more... ] --- class:inverse, mline, center, middle # The *dplyr* package --- # The main verbs of *dplyr* ## `select()` ## `filter()` ## `mutate()` ## `arrange()` ## `summarize()` ## `group_by()` --- # The main verbs of *dplyr* ## <span style = 'color:#E69F00'><code>select()</code></span> = <span style = 'color:#56B4E9'>Subset columns (variables)</span> ## `filter()` ## `mutate()` ## `arrange()` ## `summarize()` ## `group_by()` --- # dplyr ## `select()` ```r select(<DATA>, <VARIABLES>) ``` --- # dplyr ## `select()` ```r select(<DATA>, <VARIABLES>) ``` ```r carnivores ``` ``` ## # A tibble: 12,168 x 118 ## occurrence_no record_type reid_no flags collection_no ## <dbl> <chr> <dbl> <lgl> <dbl> ## 1 117266 occ NA NA 9070 ## 2 137493 occ NA NA 11601 ## 3 137495 occ NA NA 11601 ## 4 138737 occ NA NA 11798 ## 5 138738 occ NA NA 11798 ## 6 138739 occ NA NA 11798 ## 7 138740 occ NA NA 11798 ## 8 138741 occ NA NA 11798 ## 9 138742 occ NA NA 11798 ## 10 138743 occ NA NA 11798 ## # … with 12,158 more rows, and 113 more variables: ## # identified_name <chr>, identified_rank <chr>, ## # identified_no <dbl>, difference <chr>, ## # accepted_name <chr>, accepted_attr <lgl>, … ``` --- # dplyr ## `select()` ```r select(carnivores, identified_rank, accepted_name, min_ma, max_ma) ``` ``` ## # A tibble: 12,168 x 4 ## identified_rank accepted_name min_ma max_ma ## <chr> <chr> <dbl> <dbl> ## 1 species Cynodictis lacustris 33.9 37.2 ## 2 species Enaliarctos mealsi 23.0 28.1 ## 3 species Pinnarctidion bishopi 23.0 28.1 ## 4 species Indarctos atticus 7.25 11.6 ## 5 genus Protursus 7.25 11.6 ## 6 subfamily Ursinae 7.25 11.6 ## 7 species Proputorius 7.25 11.6 ## 8 species Sivaonyx bathygnathus 7.25 11.6 ## 9 genus Lutra 7.25 11.6 ## 10 species Ictitherium 7.25 11.6 ## # … with 12,158 more rows ``` --- # dplyr ## `select()` ```r select(carnivores, occurrence_no, record_type, reid_no, flags) select(carnivores, occurrence_no:flags) select(carnivores, 1:4) select(carnivores, starts_with("c")) ?select_helpers ``` --- # dplyr ## `select()` ## <span style = 'color:#E69F00'><code>filter()</code></span> = <span style = 'color:#56B4E9'>Subset rows by value</span> ## `mutate()` ## `arrange()` ## `summarize()` ## `group_by()` --- # dplyr ## `filter()` ```r filter(<DATA>, <PREDICATES>) ``` ### Predicates: `TRUE/FALSE` statements -- ### Comparisons: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal). -- ### Operators: `&` is "and", `|` is "or", and `!` is "not" -- ### `%in%` --- # dplyr ## `filter()` ```r filter(carnivores, accepted_rank == "species", max_ma > 7) ``` ``` ## # A tibble: 3,235 x 118 ## occurrence_no record_type reid_no flags collection_no ## <dbl> <chr> <dbl> <lgl> <dbl> ## 1 117266 occ NA NA 9070 ## 2 137493 occ NA NA 11601 ## 3 137495 occ NA NA 11601 ## 4 138737 occ NA NA 11798 ## 5 138741 occ NA NA 11798 ## 6 147936 occ NA NA 13061 ## 7 149438 occ NA NA 13192 ## 8 164284 occ NA NA 14768 ## 9 165362 occ NA NA 14906 ## 10 165688 occ NA NA 14939 ## # … with 3,225 more rows, and 113 more variables: ## # identified_name <chr>, identified_rank <chr>, ## # identified_no <dbl>, difference <chr>, ## # accepted_name <chr>, accepted_attr <lgl>, … ``` --- # The main verbs of *dplyr* ## `select()` ## `filter()` ## <span style = 'color:#E69F00'><code>mutate()</code></span> = <span style = 'color:#56B4E9'>Change or add a variable</span> ## `arrange()` ## `summarize()` ## `group_by()` --- # dplyr ## `mutate()` ```r mutate(<DATA>, <NAME> = <FUNCTION>) ``` --- # dplyr ## `mutate()` ```r mutate(carnivores, age_range = abs(max_ma - min_ma)) ``` ``` ## # A tibble: 12,168 x 119 ## occurrence_no record_type reid_no flags collection_no ## <dbl> <chr> <dbl> <lgl> <dbl> ## 1 117266 occ NA NA 9070 ## 2 137493 occ NA NA 11601 ## 3 137495 occ NA NA 11601 ## 4 138737 occ NA NA 11798 ## 5 138738 occ NA NA 11798 ## 6 138739 occ NA NA 11798 ## 7 138740 occ NA NA 11798 ## 8 138741 occ NA NA 11798 ## 9 138742 occ NA NA 11798 ## 10 138743 occ NA NA 11798 ## # … with 12,158 more rows, and 114 more variables: ## # identified_name <chr>, identified_rank <chr>, ## # identified_no <dbl>, difference <chr>, ## # accepted_name <chr>, accepted_attr <lgl>, … ``` --- # dplyr ## `transmute()` ```r transmute(carnivores, age_range = abs(max_ma - min_ma), age_range_sq = age_range^2) ``` ``` ## # A tibble: 12,168 x 2 ## age_range age_range_sq ## <dbl> <dbl> ## 1 3.30 10.9 ## 2 5.07 25.7 ## 3 5.07 25.7 ## 4 4.37 19.1 ## 5 4.37 19.1 ## 6 4.37 19.1 ## 7 4.37 19.1 ## 8 4.37 19.1 ## 9 4.37 19.1 ## 10 4.37 19.1 ## # … with 12,158 more rows ``` --- # The main verbs of *dplyr* ## `select()` ## `filter()` ## `mutate()` ## <span style = 'color:#E69F00'><code>arrange()</code></span> = <span style = 'color:#56B4E9'>Sort the data set</span> ## `summarize()` ## `group_by()` --- # dplyr ## `arrange()` ```r arrange(<DATA>, <SORTING VARIABLE>) ``` --- # dplyr ## `arrange()` ```r arrange(carnivores, max_ma) %>% select(max_ma, everything()) ``` ``` ## # A tibble: 12,168 x 118 ## max_ma occurrence_no record_type reid_no flags ## <dbl> <dbl> <chr> <dbl> <lgl> ## 1 0.0117 154618 occ NA NA ## 2 0.0117 212868 occ NA NA ## 3 0.0117 212869 occ NA NA ## 4 0.0117 212870 occ NA NA ## 5 0.0117 212871 occ NA NA ## 6 0.0117 212872 occ NA NA ## 7 0.0117 212873 occ NA NA ## 8 0.0117 212918 occ NA NA ## 9 0.0117 212932 occ NA NA ## 10 0.0117 212933 occ NA NA ## # … with 12,158 more rows, and 113 more variables: ## # collection_no <dbl>, identified_name <chr>, ## # identified_rank <chr>, identified_no <dbl>, ## # difference <chr>, accepted_name <chr>, … ``` --- # dplyr ## `arrange()` ```r arrange(carnivores, max_ma, lng) %>% select(max_ma, lng, everything()) ``` ``` ## # A tibble: 12,168 x 118 ## max_ma lng occurrence_no record_type reid_no flags ## <dbl> <dbl> <dbl> <chr> <dbl> <lgl> ## 1 0.0117 -177. 653974 occ NA NA ## 2 0.0117 -177. 653975 occ NA NA ## 3 0.0117 -177. 653976 occ NA NA ## 4 0.0117 -176. 1447238 occ NA NA ## 5 0.0117 -173. 819689 occ NA NA ## 6 0.0117 -173. 819690 occ NA NA ## 7 0.0117 -170. 1310863 occ NA NA ## 8 0.0117 -170. 1310864 occ NA NA ## 9 0.0117 -170. 1310865 occ NA NA ## 10 0.0117 -170. 1310851 occ NA NA ## # … with 12,158 more rows, and 112 more variables: ## # collection_no <dbl>, identified_name <chr>, ## # identified_rank <chr>, identified_no <dbl>, ## # difference <chr>, accepted_name <chr>, … ``` --- # dplyr ## `desc()` ```r arrange(carnivores, max_ma, desc(lng)) %>% select(max_ma, lng, everything()) ``` ``` ## # A tibble: 12,168 x 118 ## max_ma lng occurrence_no record_type reid_no flags ## <dbl> <dbl> <dbl> <chr> <dbl> <lgl> ## 1 0.0117 180. 1429298 occ NA NA ## 2 0.0117 180. 1268080 occ NA NA ## 3 0.0117 178. 1447237 occ NA NA ## 4 0.0117 177. 653978 occ NA NA ## 5 0.0117 177. 805370 occ NA NA ## 6 0.0117 176. 1318554 occ NA NA ## 7 0.0117 176. 1429299 occ NA NA ## 8 0.0117 176. 1447239 occ NA NA ## 9 0.0117 175. 1268186 occ NA NA ## 10 0.0117 175. 1447240 occ NA NA ## # … with 12,158 more rows, and 112 more variables: ## # collection_no <dbl>, identified_name <chr>, ## # identified_rank <chr>, identified_no <dbl>, ## # difference <chr>, accepted_name <chr>, … ``` --- class:inverse, mline, center, middle # The pipe <br> ## Passes the result on one function to another function --- # The Pipe ```r carnivores1 <- arrange(carnivores, max_ma) carnivores2 <- filter(carnivores, max_ma > 7) carnivores3 <- mutate(carnivores2, age_range = abs(max_ma - min_ma)) carnivores3 ``` -- <html> <div style='float:left'></div> <hr color='#EB811B' size=1px width=720px> </html> ```r mutate( filter( arrange(carnivores, max_ma), max_ma > 7 ), age_range = abs(max_ma - min_ma) ) ``` -- <html> <div style='float:left'></div> <hr color='#EB811B' size=1px width=720px> </html> <br> ```r arrange(carnivores, max_ma) %>% filter(max_ma > 7) %>% mutate(age_range = abs(max_ma - min_ma)) ``` --- # The Pipe ## Insert with **`ctrl/cmd + shift + m`** --- # The main verbs of *dplyr* ## `select()` ## `filter()` ## `mutate()` ## `arrange()` ## <span style = 'color:#E69F00'><code>summarize()</code></span> = <span style = 'color:#56B4E9'>Summarize the data</span> ## <span style = 'color:#E69F00'><code>group_by()</code></span> = <span style = 'color:#56B4E9'>Group the data</span> --- # dplyr ## `summarize()` ```r summarize(<DATA>, <NAME> = <FUNCTION>) ``` --- # dplyr ## `summarize()` ```r summarize(carnivores, mean_fad = mean(max_ma)) ``` ``` ## # A tibble: 1 x 1 ## mean_fad ## <dbl> ## 1 10.8 ``` -- <html> <div style='float:left'></div> <hr color='#EB811B' size=1px width=720px> </html> <br> ```r summarize(carnivores, mean_fad = mean(max_ma), sd_fad = sd(max_ma)) ``` ``` ## # A tibble: 1 x 2 ## mean_fad sd_fad ## <dbl> <dbl> ## 1 10.8 12.7 ``` --- # dplyr ## `group_by()` ```r group_by(<DATA>, <VARIABLE>) ``` --- # dplyr ## `group_by()` ```r carnivores %>% group_by(accepted_rank) ``` ``` ## # A tibble: 12,168 x 118 ## # Groups: accepted_rank [12] ## occurrence_no record_type reid_no flags collection_no ## <dbl> <chr> <dbl> <lgl> <dbl> ## 1 117266 occ NA NA 9070 ## 2 137493 occ NA NA 11601 ## 3 137495 occ NA NA 11601 ## 4 138737 occ NA NA 11798 ## 5 138738 occ NA NA 11798 ## 6 138739 occ NA NA 11798 ## 7 138740 occ NA NA 11798 ## 8 138741 occ NA NA 11798 ## 9 138742 occ NA NA 11798 ## 10 138743 occ NA NA 11798 ## # … with 12,158 more rows, and 113 more variables: ## # identified_name <chr>, identified_rank <chr>, ## # identified_no <dbl>, difference <chr>, ## # accepted_name <chr>, accepted_attr <lgl>, … ``` --- # dplyr ## `group_by()` ```r carnivores %>% group_by(accepted_rank) %>% summarise(n = n(), mean_fad = mean(max_ma)) ``` ``` ## # A tibble: 12 x 3 ## accepted_rank n mean_fad ## <chr> <int> <dbl> ## 1 family 913 13.4 ## 2 genus 3060 10.4 ## 3 infraorder 7 20.5 ## 4 order 172 20.8 ## 5 species 7637 10.4 ## 6 subfamily 277 13.1 ## 7 subgenus 5 2.28 ## 8 suborder 8 34.5 ## 9 subspecies 27 0.467 ## 10 superfamily 8 23.0 ## 11 tribe 26 7.87 ## 12 unranked clade 28 15.0 ``` --- # dplyr ## `group_by()` ```r carnivores %>% group_by(accepted_rank) %>% mutate(n = n(), mean_fad = mean(max_ma)) %>% select(n, mean_fad) ``` ``` ## # A tibble: 12,168 x 3 ## # Groups: accepted_rank [12] ## accepted_rank n mean_fad ## <chr> <int> <dbl> ## 1 species 7637 10.4 ## 2 species 7637 10.4 ## 3 species 7637 10.4 ## 4 species 7637 10.4 ## 5 genus 3060 10.4 ## 6 subfamily 277 13.1 ## 7 genus 3060 10.4 ## 8 species 7637 10.4 ## 9 genus 3060 10.4 ## 10 genus 3060 10.4 ## # … with 12,158 more rows ``` --- class: inverse, mline, middle, center # Joins --- # dplyr ## Joining data ### Use `left_join()`, `right_join()`, `full_join()`, or `inner_join()` to join datasets ### Use `semi_join()` or `anti_join()` to filter datasets against each other --- # Joins .center[  ] --- # Joins .center[  ] --- # Joins .center[  ] --- # Joins .center[  ] --- # Joins .center[  ] --- # Joins .center[  ] --- # Reading list #### [Here package](https://github.com/jennybc/here_here), intro to the `here` package. #### [Happy git with R](https://happygitwithr.com/), how to use version control with Git/ GitHub as an R user. #### [R for Data Science](https://r4ds.had.co.nz/index.html), best intro to the tidyverse. #### [Project-oriented workflow](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/), good working habits by Jenny Bryan. #### [Modern Dive](https://moderndive.com/foreword.html), statistical inference via the tidyverse. #### [Cheatsheets](https://www.rstudio.com/resources/cheatsheets/), a compilation of cheatsheets for various packages. --- class:inverse, mline, center, middle # It's your turn --- # Task - read the [Introduction to ggplot2 (Part 1)](https://gregor-mathes.netlify.app/2020/06/29/introduction-to-ggplot2-part-1/) - follow along with the exercises at `exercise_ggplot.R` .center[<img src="data:image/png;base64,#https://ak2.picdn.net/shutterstock/videos/15323182/thumb/8.jpg" alt="Allison Horsts visualisation of the here function" width="650"/> ]